{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Regression Tutorial\n",
    "\n",
    "This guide will show how to use Tribuo’s regression models to predict wine quality based on the [UCI Wine Quality](https://archive.ics.uci.edu/ml/datasets/Wine+Quality) data set. We’ll experiment with several different regression trainers: two for training linear models (SGD and Adagrad) and one for training a tree ensemble via Tribuo’s wrapper on XGBoost (note: Tribuo's XGBoost support relies upon the Maven Central XGBoost jar which only contains binaries for x86_64 platforms). We’ll run these experiments by simply swapping in different implementations of Tribuo’s `Trainer` interface. We’ll also show how to evaluate regression models and describe some common evaluation metrics.\n",
    "\n",
    "## Setup\n",
    "\n",
    "First you'll need to download the winequality dataset from UCI:\n",
    "\n",
    "`wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv`\n",
    "\n",
    "then we'll load in some jars and import a few packages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%jars ./tribuo-json-4.3.0-jar-with-dependencies.jar\n",
    "%jars ./tribuo-regression-sgd-4.3.0-jar-with-dependencies.jar\n",
    "%jars ./tribuo-regression-xgboost-4.3.0-jar-with-dependencies.jar\n",
    "%jars ./tribuo-regression-tree-4.3.0-jar-with-dependencies.jar"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import java.nio.file.Path;\n",
    "import java.nio.file.Paths;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import org.tribuo.*;\n",
    "import org.tribuo.data.csv.CSVLoader;\n",
    "import org.tribuo.datasource.ListDataSource;\n",
    "import org.tribuo.evaluation.TrainTestSplitter;\n",
    "import org.tribuo.math.optimisers.*;\n",
    "import org.tribuo.regression.*;\n",
    "import org.tribuo.regression.evaluation.*;\n",
    "import org.tribuo.regression.sgd.RegressionObjective;\n",
    "import org.tribuo.regression.sgd.linear.LinearSGDTrainer;\n",
    "import org.tribuo.regression.sgd.objectives.SquaredLoss;\n",
    "import org.tribuo.regression.rtree.CARTRegressionTrainer;\n",
    "import org.tribuo.regression.xgboost.XGBoostRegressionTrainer;\n",
    "import org.tribuo.util.Util;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading the data\n",
    "In Tribuo, all the prediction types have an associated `OutputFactory` implementation, which can create the appropriate `Output` subclasses from an input. Here we're going to use `RegressionFactory` as we're performing regression. In Tribuo both single and multidimensional regression use the `Regressor` and `RegressionFactory` classes. We then pass the `regressionFactory` into the simple `CSVLoader` which reads all the columns into a `DataSource`. The winequality dataset uses `;` to separate the columns rather than the standard `,` so we change the default separator character. Note if your csv file isn't purely numeric or you wish to use a subset of the columns as features then you should use `CSVDataSource` which allows fine-grained control over the loading and featurisation process of your csv file. There's a columnar data tutorial which details the flexibility and power of our columnar processing infrastructure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "var regressionFactory = new RegressionFactory();\n",
    "var csvLoader = new CSVLoader<>(';',regressionFactory);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We don't have a pre-defined train test split, so we take 70% as the training data, and 30% as the test data. The data is randomised using the RNG seeded by the second value. Then we feed the split data sources into the training and testing datasets. These `MutableDataset`s manage all the metadata (e.g., feature & output domains), and the mapping from feature names to feature id numbers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "var wineSource = csvLoader.loadDataSource(Paths.get(\"winequality-red.csv\"),\"quality\");\n",
    "var splitter = new TrainTestSplitter<>(wineSource, 0.7f, 0L);\n",
    "Dataset<Regressor> trainData = new MutableDataset<>(splitter.getTrain());\n",
    "Dataset<Regressor> evalData = new MutableDataset<>(splitter.getTest());"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Regression in Tribuo\n",
    "Unlike most ML packages, regression in Tribuo is multidimensional by default. This means that each `Regressor` contains a vector of named values, which like the `Feature` objects are always kept sorted in lexicographic order. However unlike features, `Regressor` objects are dense, they always include all the dimensions, even if some are zero. In practice for an output type this isn't a strong restriction as if you're working in a multidimensional space then all dimensions will usually be present, and there are many fewer output dimensions than there are features.\n",
    "\n",
    "Given this difference from other libraries, you might as why Tribuo does it this way? It's because it makes operating on probability distributions using regression algorithms significantly simpler. We do this in Tribuo's implementation of [LIME](https://www.kdd.org/kdd2016/papers/files/rfp0573-ribeiroA.pdf) for classification model explanations, and it will make future implementations of gradient boosting and similar algorithms much easier.\n",
    "\n",
    "How does this affect users of Tribuo's regression package? Well, each `Regressor` is an `Iterable<DimensionTuple>`, and a `DimensionTuple` represents the name of a dimension, along with it's regressed value and a variance if present (unknown variances are set to `Double.NaN`). If you don't name the dimensions during data loading then they are automatically named `DIM-0`, ... `DIM-N` where `N` is one less than the number of dimensions. This means in the common case of single dimensional regression you'll want to access the first element of the various state accessors, or by asking the `Regressor` for `DIM-0`, or index `0`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Num dimensions = 1\n",
      "Dimension name: DIM-0\n",
      "Dimension value: 6.0\n",
      "Tuple = [DIM-0=6.0]\n",
      "Regressor[0] = DIM-0=6.0\n"
     ]
    }
   ],
   "source": [
    "Regressor r = trainData.getExample(0).getOutput();\n",
    "System.out.println(\"Num dimensions = \" + r.size());\n",
    "\n",
    "String[] dimNames = r.getNames();\n",
    "System.out.println(\"Dimension name: \" + dimNames[0]);\n",
    "\n",
    "double[] regressedValues = r.getValues();\n",
    "System.out.println(\"Dimension value: \" + regressedValues[0]);\n",
    "\n",
    "// getDimension(String) returns an Optional<DimensionTuple>\n",
    "Regressor.DimensionTuple tuple = r.getDimension(\"DIM-0\").get();\n",
    "System.out.println(\"Tuple = [\" + tuple +\"]\");\n",
    "\n",
    "// getDimension(int) throws IndexOutOfBoundsException if you give it a negative index\n",
    "// or one greater than or equal to r.size()\n",
    "Regressor.DimensionTuple tupleI = r.getDimension(0);\n",
    "System.out.println(\"Regressor[0] = \" + tupleI);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The prediction objects produced by Tribuo's regression models contain a single `Regressor` with a value for each dimension output the model knows about. As each `Regressor` represents a full vector there is no need for a collection of them to represent the full output space, unlike `Label` in multi-class classification."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the models\n",
    "We're going to define a quick training function which accepts a trainer and a training dataset. It times the training and also prints the performance metrics. Evaluating on the training data is useful for debugging: if the model performs poorly in the training data, then we know something is wrong with either our model or our data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "public Model<Regressor> train(String name, Trainer<Regressor> trainer, Dataset<Regressor> trainData) {\n",
    "    // Train the model\n",
    "    var startTime = System.currentTimeMillis();\n",
    "    Model<Regressor> model = trainer.train(trainData);\n",
    "    var endTime = System.currentTimeMillis();\n",
    "    System.out.println(\"Training \" + name + \" took \" + Util.formatDuration(startTime,endTime));\n",
    "    // Evaluate the model on the training data\n",
    "    // This is a useful debugging tool to check the model actually learned something\n",
    "    RegressionEvaluator eval = new RegressionEvaluator();\n",
    "    var evaluation = eval.evaluate(model,trainData);\n",
    "    // We create a dimension here to aid pulling out the appropriate statistics.\n",
    "    // You can also produce the String directly by calling \"evaluation.toString()\"\n",
    "    var dimension = new Regressor(\"DIM-0\",Double.NaN);\n",
    "    System.out.printf(\"Evaluation (train):%n  RMSE %f%n  MAE %f%n  R^2 %f%n\",\n",
    "            evaluation.rmse(dimension), evaluation.mae(dimension), evaluation.r2(dimension));\n",
    "    return model;\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we're going to define an equivalent testing function which accepts a model and a test dataset, printing the performance to std out."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "public void evaluate(Model<Regressor> model, Dataset<Regressor> testData) {\n",
    "    // Evaluate the model on the test data\n",
    "    RegressionEvaluator eval = new RegressionEvaluator();\n",
    "    var evaluation = eval.evaluate(model,testData);\n",
    "    // We create a dimension here to aid pulling out the appropriate statistics.\n",
    "    // You can also produce the String directly by calling \"evaluation.toString()\"\n",
    "    var dimension = new Regressor(\"DIM-0\",Double.NaN);\n",
    "    System.out.printf(\"Evaluation (test):%n  RMSE %f%n  MAE %f%n  R^2 %f%n\",\n",
    "            evaluation.rmse(dimension), evaluation.mae(dimension), evaluation.r2(dimension));\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we'll define the four trainers we're going to compare.\n",
    "- A linear regression trained using linear decay SGD.\n",
    "- A linear regression trained using SGD and AdaGrad.\n",
    "- A regression tree using the CART algorithm with a maximum depth of 6.\n",
    "- An XGBoost trainer using 50 rounds of boosting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "var lrsgd = new LinearSGDTrainer(\n",
    "    new SquaredLoss(), // loss function\n",
    "    SGD.getLinearDecaySGD(0.01), // gradient descent algorithm\n",
    "    10,                // number of training epochs\n",
    "    trainData.size()/4,// logging interval\n",
    "    1,                 // minibatch size\n",
    "    1L                 // RNG seed\n",
    ");\n",
    "var lrada = new LinearSGDTrainer(\n",
    "    new SquaredLoss(),\n",
    "    new AdaGrad(0.01),\n",
    "    10,\n",
    "    trainData.size()/4,\n",
    "    1,\n",
    "    1L \n",
    ");\n",
    "var cart = new CARTRegressionTrainer(6);\n",
    "var xgb = new XGBoostRegressionTrainer(50);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we'll train the linear regression with SGD:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training Linear Regression (SGD) took (00:00:00:051)\n",
      "Evaluation (train):\n",
      "  RMSE 0.979522\n",
      "  MAE 0.741870\n",
      "  R^2 -0.471611\n"
     ]
    }
   ],
   "source": [
    "var lrsgdModel = train(\"Linear Regression (SGD)\",lrsgd,trainData);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluating the models\n",
    "Using our evaluation function this is pretty straightforward."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Evaluation (test):\n",
      "  RMSE 0.967450\n",
      "  MAE 0.720619\n",
      "  R^2 -0.439255\n"
     ]
    }
   ],
   "source": [
    "evaluate(lrsgdModel,evalData);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Those numbers seem poor, but what do these evaluation metrics mean?\n",
    "\n",
    "### RMSE\n",
    "\n",
    "The root-mean-square error (RMSE) summarizes the magnitude of errors between our regression model's predictions and the values we observe in our data. Basically, RMSE is the standard deviation of model prediction errors on a given dataset.\n",
    "\n",
    "$$RMSE = \\sqrt{ \\frac{1}{n} \\sum_{i=1}^{n} (y_i - \\hat{y}_i)^2 }$$\n",
    "\n",
    "Lower is better: a perfect model for the wine data would have RMSE=0. The RMSE is sensitive to how large an error was, and is thus sensitive to outliers. This also means that RMSE can be used to compare different models on the same dataset but not across different datasets, as a \"good\" RMSE value on one dataset might be larger than a \"good\" RMSE value on a different dataset. See [Wikipedia](https://en.wikipedia.org/wiki/Root-mean-square_deviation) for more info on RMSE.\n",
    "\n",
    "### MAE\n",
    "\n",
    "The mean absolute error (MAE) is another summary of model error. Unlike RMSE, each error in MAE contributes proportional to its absolute value. \n",
    "\n",
    "$$MAE = \\frac{1}{n} \\sum_{i=1}^{n} |y_i - \\hat{y}_i|$$\n",
    "\n",
    "### R^2\n",
    "\n",
    "The R-squared metric (also called the \"coefficient of determination\") summarizes how much of the variation in observed outcomes can be explained by our model. \n",
    "\n",
    "Let $\\bar{y} = \\frac{1}{n} \\sum_{i=1}^{n} y_i$, i.e., the mean deviation of observed data points from the observed mean. R^2 is given by:\n",
    "\n",
    "$$R^2 = 1 - \\frac{\\sum_{i=1}^{n} (y_i - \\hat{y}_i)^2}{\\sum_{i=1}^{n} (y_i - \\bar{y})^2}$$\n",
    "\n",
    "A value of R^2=1 means that the model accounts for all of the variation in a set of observations -- in other words, it fits a dataset perfectly. Note that R^2 can turn negative when the sum-of-squared model errors (numerator) is greater than the sum-of-squared differences between observed data points and the observed mean (denominator). In other words, when R^2 is negative, the model fits the data *worse* than simply using the observed mean to predict values.\n",
    "\n",
    "See [Wikipedia](https://en.wikipedia.org/wiki/Coefficient_of_determination) and the [Minitab blog](https://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit) for more detailed discussion of R^2.\n",
    "\n",
    "## Improving over standard SGD with AdaGrad\n",
    "\n",
    "It's not surprising the SGD results are bad: in linear decay SGD, the step size used for parameter updates changes over time (training iterations) but is uniform across all model parameters. This means that we use the same step size for a noisy/irrelevant feature as we would for an informative feature. There are many more sophisticated approaches to stochastic gradient descent. \n",
    "\n",
    "One of these is [AdaGrad](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf), which modifies the \"global\" learning rate for each parameter $p$ using the sum-of-squares of past gradients w.r.t. $p$, up to time $t$. \n",
    "\n",
    "> ... the secret sauce of AdaGrad is not on necessarily accelerating gradient descent with a better step size selection, but making gradient descent more stable to not-so-good \\(\\eta\\) choices.\n",
    "> Anastasios Kyrillidis, [Note on AdaGrad](http://akyrillidis.github.io/notes/AdaGrad)\n",
    "\n",
    "Let's try training for the same number of epochs using `AdaGrad` instead of `LinearDecaySGD`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training Linear Regression (AdaGrad) took (00:00:00:024)\n",
      "Evaluation (train):\n",
      "  RMSE 0.735311\n",
      "  MAE 0.575096\n",
      "  R^2 0.170709\n",
      "Evaluation (test):\n",
      "  RMSE 0.737994\n",
      "  MAE 0.585709\n",
      "  R^2 0.162497\n"
     ]
    }
   ],
   "source": [
    "var lradaModel = train(\"Linear Regression (AdaGrad)\",lrada,trainData);\n",
    "evaluate(lradaModel,evalData);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using a more robust optimizer got us a better fit in the same number of epochs. However, both the train and test R^2 scores are still substantially less than 1 and, as before, the train and test RMSE scores are very similar.\n",
    "\n",
    "See [here](http://akyrillidis.github.io/notes/AdaGrad) and [here](http://ruder.io/optimizing-gradient-descent/index.html#adagrad) for more on AdaGrad. Also, there are many other implementations of various well-known optimizers in Tribuo, including [Adam](https://tribuo.org/learn/4.1/javadoc/org/tribuo/math/optimisers/Adam.html) and [RMSProp](https://tribuo.org/learn/4.1/javadoc/org/tribuo/math/optimisers/RMSProp.html). See the [math.optimisers](https://tribuo.org/learn/4.1/javadoc/org/tribuo/math/optimisers/package-summary.html) package."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At this point, we showed that we can improve our model by using a more robust optimizer; however, we're still using a linear model. If there are informative, non-linear relationships among wine quality features, then our current model won't be able to take advantage of them. We'll finish this tutorial by showing how to use a couple of popular non-linear models, CART and [XGBoost](https://xgboost.ai).\n",
    "\n",
    "## Trees and ensembles\n",
    "\n",
    "Next we'll train the CART tree:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training CART took (00:00:00:092)\n",
      "Evaluation (train):\n",
      "  RMSE 0.544516\n",
      "  MAE 0.405062\n",
      "  R^2 0.545236\n",
      "Evaluation (test):\n",
      "  RMSE 0.658722\n",
      "  MAE 0.494395\n",
      "  R^2 0.332754\n"
     ]
    }
   ],
   "source": [
    "var cartModel = train(\"CART\",cart,trainData);\n",
    "evaluate(cartModel,evalData);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally we'll train the XGBoost ensemble:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training XGBoost took (00:00:00:194)\n",
      "Evaluation (train):\n",
      "  RMSE 0.143871\n",
      "  MAE 0.097167\n",
      "  R^2 0.968252\n",
      "Evaluation (test):\n",
      "  RMSE 0.599478\n",
      "  MAE 0.426673\n",
      "  R^2 0.447378\n"
     ]
    }
   ],
   "source": [
    "var xgbModel = train(\"XGBoost\",xgb,trainData);\n",
    "evaluate(xgbModel,evalData);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using gradient boosting via XGBoost improved results by a lot. Not only are the train & test fits better, but the train and test RMSE have started to diverge a little, indicating that the XGBoost model isn't underfitting like the two linear models were. XGBoost won't always be the best model for your data, but it's often a great baseline model to try when facing a new problem or dataset.\n",
    "\n",
    "## Conclusion\n",
    "\n",
    "In this tutorial, we showed how to experiment with several different regression trainers (linear decay SGD, AdaGrad, CART, XGBoost). It was easy to experiment with different trainers and models by simply swapping in different implementations of the Tribuo `Trainer` interface. We also showed how to evaluate regression models and described some common evaluation metrics."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Java",
   "language": "java",
   "name": "java"
  },
  "language_info": {
   "codemirror_mode": "java",
   "file_extension": ".jshell",
   "mimetype": "text/x-java-source",
   "name": "Java",
   "pygments_lexer": "java",
   "version": "12+33"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
